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Fortins applications to new high performance parallel and distributed platforms is a challenging tasK. 
Writing parallel code bv hand is lime consuming and costly, but this task can be simplified by high level 
languages and would even better be automated by parallelizing tools and compilers. The definition ot HPF 
(His'n Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel 
mode!) standards has offered great opportunity in this respect. Both provide simple and clear interfaces to 
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Motivation 

NAS Parallel Benchmarks (NPB) 
Programming Baseline for NPB (PBN) 
HPF Implementation 
OpenMP Implementation 
Performance Comparison 
Remarks 
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High performance computing 

► evolviag and expensive 

► code porting costly, time-consuming 

Popularity of MPI 

► high performance and widely supported (portability) 

► but, hard to program, prone to error 

Alternatives 

► computer aided tools and translators 

► data parallel languages 

► parallelizing compilers 

Goal 

► examine the effectiveness of HPF and OpenMP vs. MPI 

► using NPB as a test suite 
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Data partition 

► how data be distributed 

► domain decomposition strategy 

Computation distribution 

► independent loops and code sections 

► computation masking 

Data communication 

► when data needed but not available 

... downside 

► no incremental approach 

► low-level, hard to write 
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Data parallel language approach 

► parallelization based on data distribution, owner- 
computes -rule 

► user-added directives to distribute data and parallelize 
loops 

Strength 

► built on top of a high-level language, easy to program 

► portability from the HPF standard 

Weakness 

► questionable performance due to immaturity of 
compiler technology 

► hidden performance model, hard to track 

► lack of handling irregular computation 
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An industry standard for SMP 

► computation based on shared-memory model 

► compiler-directives to parallelize loops and 
independent code sections 

► fork-and-join model 

Strength 

► offered incremental approach to code parallelization 

► high-level constructs, easy to program 

► portable for SMP, good performance 

Weakness 

► hidden data distribution 

► not for distributed memory system 
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8 problems, 5 class (S, W, A, B, C) sizes 

► derived from CFD applications 

► specified algorithmically , not by source code 

3 pseudo-applications 

► 3T independent E lock (5x5)* Tridiagonal systems 

► SP independent Scalar- Pentadi agonal systems 

► LU Lower-Upper symmetric Gauss-Seidel 

5 kernels 

► FT .spectral method (FFT) to solve La piece equation 

► MG MuitiGrid method to solve Foisson equation 

► CG Conjugate Gradient method 

► EP random-number generator (Embarrassingly Parallel) 
IS Integer Sort 
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Source code implementation 

► with MPI communication constructs 

► coded in Fortran 77, except IS (C) 

► optimized generically, not for specific machines 

► demonstrate real-world performance for portable user 
codes 

NPB 2.3-serial 

► stripped-down versions of the MPI implementations 

► as staging points for other implementations and for 
performance test of parallelizing tools /compilers 
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What is PBN 

► based on NPB2. 3-serial 

► additional modification 

• real-world user optimization of the serial codes 

• memory optimization in BT and SP 

• hyper- plane and pipeline algorithms in LU 

• data-copy improvement in FT and IS 

• more convenient timers 

Why PBN 

► provide the optimized version of NPB2. 3-serial 

► make it available for public 

► distinguish from the official NPB 

► give sample HPF/OpenMP implementations 
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Starting point 

► benchmarks from PBN-Serial 

• BT, SP, LU, FT, CG, MG 

• excluded EP (for HPF) and IS (for HPF & OpenMP) 

Implementations 

► HPF sample implementation (PBN-H) 

• done by hand 

► OpenMP sample implementation (PBN-O) 

• created by hand with assistant of parallelizing tools 
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Data distribution 

► with ALIGN and DISTRIBUTE directives 

Expressing parallelism 

► F90 style of array expressions 

► FORALL constructs 

► INDEPENDENT directive for loops 

► HPF library intrinsics 

Data redistribution 

► to overcome incapability of multiprocessor pipelining 
and lack of the REDISTRIBUTE directive 

► needed in BT, SP, and FT 

► extra arrays used to keep the redistributed data 
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Parallel loops and sections 

► with “ ! $OMP PARALLEL DO” and “ ! $OMP PARALLEL” 

► outer -most loops for large granularity and low overhead 

► no consideration of independent code sections 

Variable privatization 

► list local variables in the “PRIVATE () ” construct 

► avoid conflict of memory access and false sharing 

Pcint-to-point synchronization 

► for multiprocessor pipeline implementation in LU 

- with the “ ! $OMP FLUSH” construct 

Others 

- data distribution based on the first-touch model 

► no need for redistribution, thus, no extra arrays 
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SGI 0rigin2000 (distributed shared memory) 

► CPU: 195MHz, 32KB LI cache, 4MB L2 cache 

► compilers 

• MIPSpro-f77 compiler 7.2.1 

• PGI pghpf-2.4.3 compiler with MPI interface 

► versions tested 

• NPB-MPI, PBN-H and PBN-O 

Cray T3E-1200 (distributed memory) 

► PE: 300 MHz, 128MB 

► compilers 

• Cray-f90 compiler 3. 1 

• PGI pghpf-2.4.3 compiler with SHM interface 

► versions tested 

• NPB-MPI and PBN-H 
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• Single processor, four different platforms 

• Class A/W problem size 







Execution Time (secs) 


• On SG[ Qrigin2GC0, 195MHz 

• Class A problem size 
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Number of CPUs 
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On Cray T3E- 1200, 300MHz 
Class A problem size 



- 4 

? [* Z — 1 'f 

* ! a - M4I *'-V » 

- M HPF 1 


1 4 *i to 5<? 


Number of CPUs 


rrK r r* IVcV 




Overfill, MPI implementation scaling the best 

multi-dimensionai partition 

► good load balance 

OpenMP performing quite well 

► close to MPI in most cases 

► even better in FT, no data transposition 

► but, 1-D multiprocessor pipeline in LU not as good 

► yet to see on larger number of processors 
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HPF catching up, but still behind 

► closer to MPI in FT and CG 

► BT and SP closer to MPI on 0rigin2000, but deviated 
quite a bit on T3E and even flat out after 32 procs 

* poor performance of MG related to the iack of handling 
irregular computation in HPF 

Serial optimization 

► affects overall performance 

► optimized BT as an example 
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Echo back 

► MPI hard to program, OpenMP easy to write 

► lack of HPF performance model still evident 

► multi- ievel parallelism in OpenMP not quite supported 

Future development 

► maturity of HPF compilers 

► better tools and compilers help ease 

• the writing of MPI programs 

• even useful for OpenMP/HPF programs 

► on our part 

• tests of PBN-H/PBN-G on more platforms 

• program development environment 
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